Data handling and curation - from raw to clean data

Aud Halbritter (adapted from Matt Grainger)

Data cycle

Data handling and curation is time consuming

Data cleaning vs Data wrangling

Data cleaning is the process of removing incorrect, duplicate, typos, or otherwise erroneous data from a dataset

Data wrangling changing the format to make it more useful for your analysis

Import data with read_delim

See Import data BioST@TS tutorial

read_delim(file = "data/plant_community_2024.csv")

Data cleaning with the Palmer penguins dataset

Exercise

Run the following commands to install the palmerpenguins package.

#install.packages("palmerpenguins")
library(palmerpenguins)
head(penguins_raw)
# A tibble: 6 × 17
  studyName `Sample Number` Species          Region Island Stage `Individual ID`
  <chr>               <dbl> <chr>            <chr>  <chr>  <chr> <chr>          
1 PAL0708                 1 Adelie Penguin … Anvers Torge… Adul… N1A1           
2 PAL0708                 2 Adelie Penguin … Anvers Torge… Adul… N1A2           
3 PAL0708                 3 Adelie Penguin … Anvers Torge… Adul… N2A1           
4 PAL0708                 4 Adelie Penguin … Anvers Torge… Adul… N2A2           
5 PAL0708                 5 Adelie Penguin … Anvers Torge… Adul… N3A1           
6 PAL0708                 6 Adelie Penguin … Anvers Torge… Adul… N3A2           
# ℹ 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
#   `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
#   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
#   `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>

Clean names with janitor

penguins_raw |> 
  names() 
 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"           

Clean names with janitor

penguins_raw |> 
  names() 
 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"           
penguins_clean <- penguins_raw |> 
  janitor::clean_names(case = "snake") 

penguins_clean |> 
  names()
 [1] "study_name"        "sample_number"     "species"          
 [4] "region"            "island"            "stage"            
 [7] "individual_id"     "clutch_completion" "date_egg"         
[10] "culmen_length_mm"  "culmen_depth_mm"   "flipper_length_mm"
[13] "body_mass_g"       "sex"               "delta_15_n_o_oo"  
[16] "delta_13_c_o_oo"   "comments"         

Get a glimps wiht dplyr

penguins_clean |>
  dplyr::glimpse()
Rows: 344
Columns: 17
$ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
$ sample_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
$ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ stage             <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult, …
$ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
$ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
$ date_egg          <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, 200…
$ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
$ delta_15_n_o_oo   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18718,…
$ delta_13_c_o_oo   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805, …
$ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…

Skim your data with skimr

out <- penguins_clean |> 
  skimr::skim() 

out
Data summary
Name penguins_clean
Number of rows 344
Number of columns 17
_______________________
Column type frequency:
character 9
Date 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
study_name 0 1.00 7 7 0 3 0
species 0 1.00 33 41 0 3 0
region 0 1.00 6 6 0 1 0
island 0 1.00 5 9 0 3 0
stage 0 1.00 18 18 0 1 0
individual_id 0 1.00 4 6 0 190 0
clutch_completion 0 1.00 2 3 0 2 0
sex 11 0.97 4 6 0 2 0
comments 290 0.16 18 68 0 10 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_egg 0 1 2007-11-09 2009-12-01 2008-11-09 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sample_number 0 1.00 63.15 40.43 1.00 29.00 58.00 95.25 152.00 ▇▇▆▅▃
culmen_length_mm 2 0.99 43.92 5.46 32.10 39.23 44.45 48.50 59.60 ▃▇▇▆▁
culmen_depth_mm 2 0.99 17.15 1.97 13.10 15.60 17.30 18.70 21.50 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.00 190.00 197.00 213.00 231.00 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.00 3550.00 4050.00 4750.00 6300.00 ▃▇▆▃▂
delta_15_n_o_oo 14 0.96 8.73 0.55 7.63 8.30 8.65 9.17 10.03 ▃▇▆▅▂
delta_13_c_o_oo 13 0.96 -25.69 0.79 -27.02 -26.32 -25.83 -25.06 -23.79 ▆▇▅▅▂

Exercise

Run the following functions:

clean_names()

glimpse()

skimr::skim()

Data validation

library(validate)

rules <- validator(culm = culmen_length_mm >= 0, 
                   body = body_mass_g > 3000,
                   species = is.character(species),
                   sex = sex %in% c("FEMALES", "MALE"))

Data validation

out   <- confront(penguins_clean, rules)
summary(out)
     name items passes fails nNA error warning                     expression
1    culm   344    342     0   2 FALSE   FALSE culmen_length_mm - 0 >= -1e-08
2    body   344    331    11   2 FALSE   FALSE             body_mass_g > 3000
3 species     1      1     0   0 FALSE   FALSE          is.character(species)
4     sex   344    168   165  11 FALSE   FALSE sex %vin% c("FEMALES", "MALE")

Data validation

plot(out)

Rules for data cleaning

  • Deposit and back-up raw data files
  • Do not manually manipulate the raw data files
  • Clean data using code
  • Check the cleaning code

Keep track of changes with tidylog

library(tidylog)
dat <- penguins_clean |> 
  tidylog::select(study_name, species, island, individual_id, body_mass_g) |> 
  tidylog::filter(species %in% c("Adelie Penguin (Pygoscelis adeliae)"))
select: dropped 12 variables (sample_number, region, stage, clutch_completion, date_egg, …)
filter: removed 192 rows (56%), 152 rows remaining

Remove duplicates with dplyr

# A tibble: 4 × 2
  Day       Person
  <chr>     <chr> 
1 Monday    Becks 
2 Tuesday   Amy   
3 Wednesday Matt  
4 Wednesday Matt  

Remove duplicates with dplyr

duplicate_data |> 
  tidylog::distinct()
distinct: removed one row (25%), 3 rows remaining
# A tibble: 3 × 2
  Day       Person
  <chr>     <chr> 
1 Monday    Becks 
2 Tuesday   Amy   
3 Wednesday Matt  

Missing data

Missing data is often a problem, e.g. for running a model. Typically as ecologists we sweep missing data under the carpet by using a “complete case” approach to data analysis.

If you have ever written some code like this:

newdf <- na.omit(penguins_clean)

newdf <- penguins_clean[complete.cases(penguins_clean), ]

newdf <- penguins_clean |>
  tidyr::drop_na()

you are removing missing data (NAs) from your dataset.

Why is this a problem?

By throwing away potentially useful data (only including those rows without a NA in them) you reduce the information you are working with, reduce statistical power and introduce selection bias (invalidating any assumption of randomisation).

Vizualise missing data with naniar

penguins_messy |> 
  naniar::vis_miss()

Different types of missingness

Data can be missing random or systematically.

library(naniar)
penguins_messy |> 
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_miss_point()

Data wrangling with dplyr

Data wrangling with dplyr

Data wrangling with tidyr

Exercise

Use the penguins dataset to:

  • Select the columns species, island and body_mass_g

  • Filter for the species Adelie and the island Biscoe and Dream

  • Convert the penguin body mass from gram to kilogram

  • Calculate the mean bill length and standard error for each penguin species

  • Make a wide dataset showing the body mass for each penguin species in a separate column. And turn it back to a long dataframe.

Reproducible workflows

Reproducible workflows